Using Natural Language Processing to Enable In-depth Analysis of Clinical Messages Posted to an Internet Mailing List: A Feasibility Study

نویسندگان

  • Amit Archaya
  • Qing Zeng
  • Alla Keselman
  • Tanja Bekhuis
  • Marcos Kreinacke
  • Heiko Spallek
  • Mei Song
  • Jean A O'Donnell
چکیده

BACKGROUND An Internet mailing list may be characterized as a virtual community of practice that serves as an information hub with easy access to expert advice and opportunities for social networking. We are interested in mining messages posted to a list for dental practitioners to identify clinical topics. Once we understand the topical domain, we can study dentists' real information needs and the nature of their shared expertise, and can avoid delivering useless content at the point of care in future informatics applications. However, a necessary first step involves developing procedures to identify messages that are worth studying given our resources for planned, labor-intensive research. OBJECTIVES The primary objective of this study was to develop a workflow for finding a manageable number of clinically relevant messages from a much larger corpus of messages posted to an Internet mailing list, and to demonstrate the potential usefulness of our procedures for investigators by retrieving a set of messages tailored to the research question of a qualitative research team. METHODS We mined 14,576 messages posted to an Internet mailing list from April 2008 to May 2009. The list has about 450 subscribers, mostly dentists from North America interested in clinical practice. After extensive preprocessing, we used the Natural Language Toolkit to identify clinical phrases and keywords in the messages. Two academic dentists classified collocated phrases in an iterative, consensus-based process to describe the topics discussed by dental practitioners who subscribe to the list. We then consulted with qualitative researchers regarding their research question to develop a plan for targeted retrieval. We used selected phrases and keywords as search strings to identify clinically relevant messages and delivered the messages in a reusable database. RESULTS About half of the subscribers (245/450, 54.4%) posted messages. Natural language processing (NLP) yielded 279,193 clinically relevant tokens or processed words (19% of all tokens). Of these, 2.02% (5634 unique tokens) represent the vocabulary for dental practitioners. Based on pointwise mutual information score and clinical relevance, 325 collocated phrases (eg, fistula filled obturation and herpes zoster) with 108 keywords (eg, mercury) were classified into 13 broad categories with subcategories. In the demonstration, we identified 305 relevant messages (2.1% of all messages) over 10 selected categories with instances of collocated phrases, and 299 messages (2.1%) with instances of phrases or keywords for the category systemic disease. CONCLUSIONS A workflow with a sequence of machine-based steps and human classification of NLP-discovered phrases can support researchers who need to identify relevant messages in a much larger corpus. Discovered phrases and keywords are useful search strings to aid targeted retrieval. We demonstrate the potential value of our procedures for qualitative researchers by retrieving a manageable set of messages concerning systemic and oral disease.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Question Answer System Based on Confirmed Knowledge Developed by Using Mails Posted to a Mailing List

In this paper, we report a QA system which can answer how type questions based on the confirmed knowledge base which was developed by using mails posted to a mailing list. We first discuss a problem of developing a knowledge base by using natural language documents: wrong information in natural language documents. Then, we describe a method of detecting wrong information in mails posted to a ma...

متن کامل

Examination of Authors' Stylistic Elements of Electronic Messages based on Researched Studies

Identifying author is an important issue in natural language processing and text classification. It shows the author's characteristic in various texts. The rapid development of the Internet causes Web-based tools such as email and blogs with an anonymous identity become a popular method of communication for the perpetrators. Moreover, it creates some specific security issues. In this paper, we ...

متن کامل

Internet Outages, the Eyewitness Accounts: Analysis of the Outages Mailing List

Understanding network reliability and outages is critical to the “health” of the Internet infrastructure. Unfortunately, our ability to analyze Internet outages has been hampered by the lack of access to public information from key players. In this paper, we leverage a somewhat unconventional dataset to analyze Internet reliability—the outages mailing list. The mailing list is an avenue for net...

متن کامل

Confirmed Knowledge Acquisition Using Mails Posted to a Mailing List

In this paper, we first discuss a problem of developing a knowledge base by using natural language documents: wrong information in natural language documents. It is almost inevitable that natural language documents, especially web documents, contain wrong information. As a result, it is important to investigate a method of detecting and correcting wrong information in natural language documents...

متن کامل

A Critical Functional Approach to Educational Discourses of Students and Professors over the Internet Context

This paper investigated the ways Iranian B.A and M.A students of English language and their professors represent themselves linguistically in their e-mails in general, and the ways they construct and negotiate power with regard to social and cultural norms in particular. It examined 84 e-mail messages students and professors exchanged in 2012-2013 academic year through Halliday`s Systemic Funct...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره 13  شماره 

صفحات  -

تاریخ انتشار 2011